Skip to content

fix: allow SaveModel after Complete() in VowpalWabbitThreadedLearning#4913

Merged
JohnLangford merged 4 commits intomasterfrom
fix/threaded-learning-save-model
Mar 19, 2026
Merged

fix: allow SaveModel after Complete() in VowpalWabbitThreadedLearning#4913
JohnLangford merged 4 commits intomasterfrom
fix/threaded-learning-save-model

Conversation

@JohnLangford
Copy link
Copy Markdown
Member

Summary

Fixes #4911VowpalWabbitThreadedLearning was nearly impossible to use because SaveModel() and PerformanceStatistics had to be called before Complete(), which is the opposite of what users expect.

Root cause: Complete() immediately called syncActions.CompleteAdding(), closing the queue before the completion continuation (which actually drains it) had a chance to run. Any SaveModel() call after Complete() threw InvalidOperationException.

Changes:

  • Allow SaveModel/PerformanceStatistics after completion — detect post-completion state and save directly on the root VW instance. Uses TryAdd on the sync action list as a fallback for the race window between Complete() and the continuation.
  • Move CompleteAdding into the root completion continuation — uses a new atomic CompleteAndRemoveAll() on ConcurrentList<T>, so sync actions can be enqueued between calling Complete() and the continuation executing.
  • Add Flush() method — forces an AllReduce synchronization and drains pending sync actions on demand, without requiring the example count to hit a multiple of ExampleCountPerRun.

All changes are backward-compatible — existing code that calls SaveModel before Complete() continues to work.

The natural pattern now works:

await model.Complete();
await model.SaveModel(path);   // no longer throws
var stats = await model.PerformanceStatistics;  // also works

Mid-training saves also work without count alignment:

var saveTask = model.SaveModel(path);
await model.Flush();   // forces sync now
await saveTask;
// continue learning...

Test plan

  • TestSaveModelAfterComplete — save after Complete() produces a valid model file
  • TestPerformanceStatisticsAfterComplete — stats accessible after Complete()
  • TestFlushFlush() triggers sync and save mid-training, learning continues after
  • TestSaveModelBeforeComplete — original pattern (save before complete) still works
  • Existing TestAllReduce continues to pass (CI)

…WabbitThreadedLearning

Previously, calling SaveModel() or accessing PerformanceStatistics after
Complete() threw InvalidOperationException because Complete() immediately
closed the sync action queue. This forced users into a counter-intuitive
pattern of enqueuing saves before signaling completion.

Changes:
- Move CompleteAdding from Complete() into the root completion continuation
  using a new atomic CompleteAndRemoveAll(), so sync actions can be enqueued
  between the Complete() call and the continuation executing
- Make SaveModel/PerformanceStatistics detect post-completion state and
  operate directly on the root VW instance via TryAdd fallback
- Add Flush() method to force AllReduce sync on demand without waiting
  for ExampleCountPerRun threshold

Fixes #4911
Task.CompletedTask was introduced in .NET 5 and is not available in
netstandard2.0 which is the target framework for vw.parallel.
Improve XML doc comments on VowpalWabbitThreadedLearning to explain:
- Learn() enqueues and returns immediately (async dispatch, not blocking)
- Typical usage flow with code example (learn, complete, save)
- What Complete() guarantees (all examples learned, final allreduce done)
- That SaveModel/PerformanceStatistics work synchronously after Complete

Addresses feedback from #4911 about the TPL completion model being unclear.
@JohnLangford JohnLangford merged commit 45d45c5 into master Mar 19, 2026
84 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VowpalWabbitThreadedLearning is almost impossible to use

1 participant